Search CORE

9 research outputs found

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

Author: Chakravarthi Bharathi Raja
McCrae John P.
Muralidaran Vigneshwaran
Priyadharshini Ruba
Publication venue
Publication date: 30/05/2020
Field of study

arXiv.org e-Print Archive

Irish Universities

Access to Research at National University of Ireland, Galway

DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text

Author: Chakravarthi Bharathi Raja
Jose Navya
McCrae John P.
Muralidaran Vigneshwaran
Priyadharshini Ruba
Sherly Elizabeth
Suryawanshi Shardul
Publication venue
Publication date: 17/06/2021
Field of study

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning methods. The dataset is available on Github (https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo (https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).Comment: 36 page

arXiv.org e-Print Archive

Online Research @ Cardiff

PubMed Central

Findings of the VarDial Evaluation Campaign 2021

Author: Chakravarthi Bharathi Raja
Găman Mihaela
Ionescu Radu Tudor
Jauhiainen Heidi
Jauhiainen Tommi
Lindén Krister
Ljubešić Nikola
Partanen Niko
Priyadharshini Ruba
Purschke Christoph
Rajagopal Eswari
Scherrer Yves
Zampieri Marcos
Publication venue
Publication date: 01/01/2021
Field of study

Open Repository and Bibliography - Luxembourg

Corpus creation for sentiment analysis in code-mixed Tamil-English text

Author: Chakravarthi Bharathi Raja
McCrae John P.
Muralidaran Vigneshwaran
Priyadharshini Ruba
Publication venue: European Language Resources Association (ELRA)
Publication date: 24/07/2020
Field of study

Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmarkThis publication has emanated from research supported in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 (Insight), SFI/12/RC/2289 P2 (Insight 2), co-funded by the European Regional Development Fund as well as by the EU H2020 programme under grant agreements 731015 (ELEXIS-European Lexical Infrastructure), 825182 (Pret- ˆ a-LLOD), and Irish Research Council ` grant IRCLA/2017/129 (CARDAMOM-Comparative Deep Models of Language for Minority and Historical Languages).non-peer-reviewe

Access to Research at National University of Ireland, Galway

Corpus creation for sentiment analysis in code-mixed Tamil-English text

Author: Chakravarthi Bharathi Raja
McCrae John P.
Muralidaran Vigneshwaran
Priyadharshini Ruba
Publication venue: European Language Resources Association (ELRA)
Publication date: 27/07/2020
Field of study

Irish Universities

Offensive language identification in dravidian languages using MPNet and CNN

Author: Bharathi Raja Chakravarthi
Manoj Balaji Jagadeeshan
Ruba Priyadharshini
Vasanth Palanikumar
Publication venue: 'Elsevier BV'
Publication date: 01/04/2023
Field of study

Social media has effectively replaced traditional forms of communication and marketing. As these platforms allow for the free expression of ideas and facts through text, images, and videos, there exists a significant need to screen them to safeguard people and organisations from objectionable information directed at them. Our work aims to categorise code-mixed social media comments and posts in Tamil, Malayalam, and Kannada into offensive or not offensive at different levels. We present a multilingual MPNet and CNN fusion model for detecting offensive language content directed at an individual (or group) in low-resource Dravidian languages at different levels. Our model is capable of handling data that has been code-mixed, such as Tamil and Latin scripts. The model was successfully validated on the datasets, achieving offensive language detection results better than those of other baseline models with weighted average F1-score of 0.85, 0.98, and 0.76, and performed better than the baseline models EWDT, and EWODT by 0.02, 0.02, 0.04 for Tamil, Malayalam, and Kannada respectively

Directory of Open Access Journals

Multilingual multimodal machine translation for Dravidian languages utilizing phonetic transcription

Author: Arcan Mihael
Chakravarthi Bharathi Raja
Jayapal Arun
McCrae John P.
Priyadharshini Ruba
Sridevy S.
Stearns Bernardo
Zarrouk Manel
Publication venue: European Association for Machine Translation
Publication date: 20/08/2019
Field of study

Multimodal machine translation is the task of translating from a source text into the target language using information from other modalities. Existing multimodal datasets have been restricted to only highly resourced languages. In addition to that, these datasets were collected by manual translation of English descriptions from the Flickr30K dataset. In this work, we introduce MMDravi, a Multilingual Multimodal dataset for under-resourced Dravidian languages. It comprises of 30,000 sentences which were created utilizing several machine translation outputs. Using data from MMDravi and a phonetic transcription of the corpus, we build an Multilingual Multimodal Neural Machine Translation system (MMNMT) for closely related Dravidian languages to take advantage of multilingual corpus and other modalities. We evaluate our translations generated by the proposed approach with human-annotated evaluation dataset in terms of BLEU, METEOR, and TER metrics. Relying on multilingual corpora, phonetic transcription, and image features, our approach improves the translation quality for the underresourced languages.This work is supported by a research grant from Science Foundation Ireland, co-funded by the European Regional Development Fund, for the Insight Centre under Grant Number SFI/12/RC/2289 and the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731015, ELEXIS - European Lexical Infrastructure and grant agreement No 825182, Pret- ˆ a-` LLOD.non-peer-reviewe

Irish Universities

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Access to Research at National University of Ireland, Galway

Multilingual multimodal machine translation for Dravidian languages utilizing phonetic transcription

Author: Arcan Mihael
Chakravarthi Bharathi Raja
Jayapal Arun
McCrae John P.
Priyadharshini Ruba
Sridevy S.
Stearns Bernardo
Zarrouk Manel
Publication venue: European Association for Machine Translation
Publication date: 10/09/2019
Field of study

Irish Universities

Findings of the VarDial Evaluation Campaign 2021

Author: Chakravarthi Bharathi Raja
Gaman Mihaela
Ionescu Radu Tudor
Jauhiainen Heidi
Jauhiainen Tommi
Linden Krister
Ljubešić Nicola
Partanen Niko
Priyadharshini Ruba
Purschke Christoph
Rajagopal Esra
Scherrer Yves
Zampieri Marcos
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2021
Field of study

This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021. The campaign was part of the eighth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2021. Four separate shared tasks were included this year: Dravidian Language Identification (DLI), Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). DLI was organized for the first time and the other three continued a series of tasks from previous evaluation campaigns.Non peer reviewe

Helsingin yliopiston digitaalinen arkisto

Open Repository and Bibliography - Luxembourg